Semiconductor manufacturing process
A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.
sensor-data.csv : (1567, 592) The data consists of 1567 datapoints each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.
We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score,confusion_matrix,recall_score,classification_report,roc_auc_score,roc_curve
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import pickle
#A. Import ‘signal-data.csv’ as DataFrame.
print("----Signal data----\n")
Signaldf=pd.read_csv('signal-data.csv')
print("Shape",Signaldf.shape)
Signaldf.head()
#B. Print 5 point summary and share at least 2 observations
Signaldf.describe(include='all')
Signaldf.info()
Observations
#A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature.
featurewithnull=0
feature_morethan20per_null=0
removedfeature=[]
for feature in Signaldf.columns:
#if data missing
if(Signaldf[feature].isnull().sum()):
featurewithnull=featurewithnull+1
missing_per=round((Signaldf[feature].isnull().sum()/len(Signaldf[feature]))*100,2)
#Missing percentage >= 20
if(missing_per>=20):
feature_morethan20per_null=feature_morethan20per_null+1
Signaldf=Signaldf.drop(feature, axis=1)
removedfeature.append(feature)
#Missing percentage < 20 imputing with mean
else:
Signaldf[feature].fillna( round(Signaldf[feature].mean(),4), inplace=True)
print("Number of feature which has null values",featurewithnull)
print("Number of feature which has more than 20 percentage null values",feature_morethan20per_null)
print("Shape after missing value treatment ",Signaldf.shape)
#B. Identify and drop the features which are having same value for all the rows.
count=0
for feature in Signaldf.columns[1:]:
if Signaldf[feature].std()==0:
Signaldf.drop([feature],axis=1,inplace=True)
removedfeature.append(feature)
count=count+1
print("No. of columns whose standard deviation was 0 and hence dropped:",count)
print("Shape after Missing value treatment and removing feature with same value for all rows",Signaldf.shape)
#C. Drop other features if required using relevant functional knowledge. Clearly justify the same.
#Time feature will not have predict value to the target
Signaldf.drop(['Time'],axis=1,inplace=True)
removedfeature.append('Time')
print ("Time feature is removed")
#Low variance filter
col = Signaldf.columns.values
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.2)
sel.fit_transform(Signaldf)
features = col[(sel.get_support(indices=True))]
#print(sel.get_support())
#print(features)
Signaldf = Signaldf.filter(features)
#print(Signaldf.columns)
print("No. of columns with less variance and hence dropped:",len(col)-len(features))
Time feature will not have predict value to the target
Feature with less variance is likely to be useful for classifying the target. In the previous step already zero variance feature are removed that is feature with same constant for all the values in the row are removed.
Further removing the features with less than 0.2 variance and reducing the dimension
print("Shape after low variance filter",Signaldf.shape)
#D. Check for multi-collinearity in the data and take necessary action
corr_matrix = Signaldf.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []
# Iterate through the correlation matrix and compare correlations
for i in iters:
for j in range(i+1):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index
val = abs(item.values)
# If correlation exceeds the threshold
if val >= 0.70:
# Print the correlated features and the correlation value
print(col.values[0], "X", row.values[0], "=", round(val[0][0], 2))
drop_cols.append(col.values[0])
drops = set(drop_cols)
Signaldf = Signaldf.drop(columns=drops)
print("No. of columns who are highly correlated and hence dropped: ",len(drops))
print("Shape after multicolinearity feature removal",Signaldf.shape)
#E. Make all relevant modifications on the data using both functional/logical reasoning/assumption
#Analysing data
Signaldf.describe(include='all').T
Feature [419,499,500,511,521]- 50% of data are 0 Action 1:. removing those features
outliers are Observed in many features Action 2:impute the outliers with lower whisker and upper whisker accordingly
Target value are in -1 and 1 --Action 3: PASS/fail shuld be cnverted to 0,1 for easy interpretation
#Action 1
count=0
for feature in Signaldf.columns[:-1]:
if (np.percentile(Signaldf[feature],50)==0):
Signaldf.drop([feature],axis=1,inplace=True)
removedfeature.append(feature)
print(feature)
count=count+1
print("No. of columns who have 50% of data as 0 and hence dropped: ",count)
print("Shape after removing majority of zero feature",Signaldf.shape)
#Action 2
def outlierdetection(df):
print(" \n------ Outlier Detection ------\n")
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
#print(IQR)
outliers=np.where((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
if (len(outliers[0])==0):
print("No Outliers Found")
else:
print(outliers)
outlierhandled=outliertreatment(df,Q1,Q3,IQR)
return outlierhandled
#Outlier treatment
def outliertreatment(df,Q1,Q3,IQR):
# Replace every outlier on the lower side by the lower whisker
for i, j in zip(np.where(df < Q1 - 1.5 * IQR)[0], np.where(df < Q1 - 1.5 * IQR)[1]):
whisker = Q1 - 1.5 * IQR
df.iloc[i,j] = whisker[j]
#Replace every outlier on the upper side by the upper whisker
for i, j in zip(np.where(df > Q3 + 1.5 * IQR)[0], np.where(df > Q3 + 1.5 * IQR)[1]):
whisker = Q3 + 1.5 * IQR
df.iloc[i,j] = whisker[j]
return df
Signaldf2=Signaldf.copy()
Signaldf2.iloc[:,:-1]=outlierdetection(Signaldf.iloc[:,:-1])
print("Shape after data pre-prcessing",Signaldf2.shape)
#target balance
Signaldf2['Pass/Fail'].value_counts()
#Action 3
Signaldf2['Pass/Fail']=Signaldf2['Pass/Fail'].replace(1,0).astype('int64')
Signaldf2['Pass/Fail']=Signaldf2['Pass/Fail'].replace(-1,1).astype('int64')
#A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis.
fig,axes=plt.subplots(18,7,figsize=(20,90))
i=0
for c in Signaldf.columns[:-1]: # finding the axes and plotting distribution
ax = axes[i // 7, i % 7]
sns.distplot(Signaldf[c],bins=5,ax=ax);
i=i+1
Normally distributed Features : Feature 0,1,2,6,14,23,90,115.
Features with median less than mean that is left skew in data : 3,12,15,16,32,33,63,117,132,145,151,155,159,160,161,162,167,182,185,200,201,250,223,426,429,432,438,439,467,476,520,525,572
Feature with median greater than mean that is Right skew in data : 27,570
Features where Majority of data falls in one bin others are outliers: 4,15,16,32,33,63,67,135,142,151,155,167,185,200,201,223,250,429,438,439,467,420,523,550,572,585
Feature 4 - 75th percentile is 1.5 and max is in 1114. hence, there is a outlier.
Feature 24 normally distributed with some dip in the middle
fig,axes=plt.subplots(18,7,figsize=(20,60))
i=0
for c in Signaldf.columns[:-1]:
ax = axes[i // 7, i % 7]
sns.boxplot(Signaldf[c],ax=ax);
i=i+1
Feature without Outliers: 51,136,418,419,482,486,488,539
Upper Fence outliers : 14,32,43,67,117,122,134,135,137,139,142,150,151,155,159,160,161,162,166,167,182,183,185,188,201,208,218,223,225,250,269,417,423,426,429,432,433,438,439,442,453,460,467,468,472,4476,483,484,485,487,489,491,493,494,496,510,520,523,525,526,527,545,547,548,549,561,572,585
Lower Fence Outliers:27,28,40,133
Both upper fence and lower fence outliers: 0,1,2,3,6,12,15,16,18,21,23,24,33,41,45,6,71,83,88,90,115,129,138,180,200,416,550,562,564,569,570
Belw box plot is plotted after imputing outliers with lower whisker and upper whisker accordingly.
fig,axes=plt.subplots(18,7,figsize=(20,65))
i=0
for c in Signaldf2.columns[:-1]: #for every categorical calumn , finding the axes and plotting distribution
ax = axes[i // 7, i % 7]
sns.boxplot(Signaldf2[c],ax=ax);
i=i+1
Outliers are handled
Normally distributed Features : Feature 0,1,2,6,14,23,90,115.
Left Skew: 3,12,15,16,32,33,63,117,132,145,151,155,159,160,161,162,167,182,185,200,201,250,223,426,429,432,438,439,467,476,520,525,572
Right Skew: 27,570
#B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis.
Target data analysis
labels = ['Pass', 'Fail']
size = Signaldf['Pass/Fail'].value_counts()
colors = ['green', 'red']
explode = [0, 0.1]
#plt.style.use('seaborn-deep')
plt.rcParams['figure.figsize'] = (4, 4)
plt.pie(size, labels =labels, colors = colors, explode = explode, autopct = "%.2f%%", shadow = True)
plt.axis('off')
plt.title('Target: Pass or Fail', fontsize = 20)
plt.legend()
plt.show()
Signaldf['Pass/Fail'].value_counts().plot(kind="bar");
It is observed that target value are imbalanced they have to be balanced with SMOTE
target pass is 93.36% where as Target fail is 6.64% only
plt.figure(figsize=(15,8))
ax=sns.stripplot(y="12", x="18", hue='Pass/Fail' ,data=Signaldf2)
plt. title("12 Vs 18 distinguished by pass/Fail")
plt.show()
feature 12 and 18 is positively correlated and pass and fail are distributed across the data
plt.figure(figsize=(15,8))
sns.swarmplot(x="0", y="1", hue='Pass/Fail' , data=Signaldf2)
plt. title("0 Vs 1 distinguished by pass/Fail")
plt.show()
There seems to no correlation between the two feature which are near[0,1] to each other
sns.pairplot(Signaldf2.iloc[:,0:16]);
Since multicolinearity data was removed with threshold of 7.5 ; Most of the data are less correlated. there are only few positive corelation observerd in data [ featre 12 and 18 are positively correlated ]
sns.pairplot(Signaldf2.iloc[:,16:31]);
Since multiclinearity data was removed with threshold of 7.5 ; Most of the data are less correlated. there are only few positive corelation observerd in data [ featre 45 and 62 are positively correlated ]
mask = np.triu(np.ones_like(Signaldf2.corr()))
plt.figure(figsize=(50,50))
sns.heatmap(Signaldf2.corr(),mask=mask,cmap="magma");
From the colour patttern , it is observed that very less features are correalted
Reason : multicolinearity data was removed with threshold of 7.5
#A. Segregate predictors vs target attributes.
x=Signaldf2.drop("Pass/Fail" , axis=1)
y =Signaldf2["Pass/Fail"]
#B. Check for target balancing and fix it if found imbalanced
Signaldf2['Pass/Fail'].value_counts().plot(kind="bar");
plt.show()
def balancingtarget(X,Y):
print("\n------ Balancing the data with SMOTE ------\n")
sm = SMOTE(random_state = 40, sampling_strategy='all')
X_res, Y_res = sm.fit_resample(X, Y)
# Before oversampling
unique, counts = np.unique(Y, return_counts = True)
print("Before Sampling\n",np.asarray((unique, counts)).T)
# After oversampling
unique, counts = np.unique(Y_res, return_counts = True)
print("After Sampling\n",np.asarray((unique, counts)).T)
return X_res,Y_res
X_res,Y_res=balancingtarget(x,y)
Before data standardization train and test data needs to be splitted . Or else data leak may happen
def Train_test_split(x,y):
print("\n------ Train test split ------\n")
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.20, random_state=12)
return X_train, X_test, Y_train, Y_test
X_train, X_test, Y_train, Y_test=Train_test_split(X_res, Y_res)
print("Xtrain shape :",X_train.shape)
print("Xtest shape :",X_test.shape)
print("Ytrain shape :",Y_train.shape)
print("Ytest shape :",Y_test.shape)
#Standardization
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler
def standardize(X_train, X_test, Y_train, Y_test):
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)
#X_train_std=X_train.apply(zscore)
#X_test_std=X_test.apply(zscore)
Y_train_std = Y_train.values
Y_test_std= Y_test.values
return X_train_std, X_test_std, Y_train_std, Y_test_std
X_train_std, X_test_std, Y_train_std, Y_test_std=standardize(X_train, X_test, Y_train, Y_test)
#D. Check if the train and test data have similar statistical characteristics when compared with original data.
X_train.describe()
X_test.describe()
Signaldf2.describe()
Mean , median and mode of all the original, train and test data are in th same range. Still I'll do hypothesis testing to prove
let my H0 = Mean of train and test data are same Ha= Mean of train and test data are not same
Ztest to identify the statisitical similarities
from statsmodels.stats import weightstats as stests
ztest,pval=stests.ztest(X_test['0'],x2=X_train['0'],value=0,alternative='two-sided')
print("Pvalue of original and Train data",pval)
if(pval<0.05):
print("Reject Null Hypothesis, Mean of test and train data for feature 0 is not same")
else:
print("Fail to reject Null Hypothesis , Mean of test and train data for feature 0 is same")
#ANOVA
#let my H0 = Mean of original, train and test data are same
#Ha= Mean of original, train and test data are not same
from scipy.stats import f_oneway
ztest,pval=f_oneway(Signaldf2['2'], X_train['2'], X_test['2'])
print("feature 0, Pvalue" ,pval)
if(pval<0.05):
print("Reject Null Hypothesis, Mean of original, test and train data for feature 2 is not same")
else:
print("Fail to reject Null Hypothesis , Mean of original,test and train data for feature 2 is same")
From Ztest and anova, its observerd that selected random feature have similar characteristics in original, test and train split.
#A. Use any Supervised Learning technique to train a model.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 100, random_state=10)
rfc = rfc.fit(X_train_std, Y_train_std)
y_predict = rfc.predict(X_test_std)
print("Training score",rfc.score(X_train_std , Y_train_std))
print("Test Score",rfc.score(X_test_std , Y_test_std))
# Classification Report
print('\n{}'.format(classification_report(Y_test_std, y_predict)))
# Confusion Matrix
cm = confusion_matrix(Y_test_std, y_predict)
print('\nConfusion Matrix:\n', cm)
df_cm = pd.DataFrame(cm, index = [i for i in [-1,1]],columns = [i for i in [-1,1]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
plt.show()
# Accuracy Score
auc = accuracy_score(Y_test_std, y_predict)
print('\nAccuracy Score:\n', auc.round(3))
#B. Use cross validation techniques. Hint: Use all CV techniques
#CROSS VALIDATION WITH K FOLD
from sklearn.model_selection import KFold, cross_val_score
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds)
results = cross_val_score(rfc, x, y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
#startified k fold
num_folds = 10
stratifiedkfold = StratifiedKFold(n_splits=num_folds)
results = cross_val_score(rfc, x, y, cv=stratifiedkfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
#LOOCV TECHNIQUE
from sklearn.model_selection import LeaveOneOut, cross_val_score
scores = cross_val_score(rfc,x,y, cv=LeaveOneOut())
print(scores.mean())
print(scores.std())
#Bootstrapping
result=[]
from sklearn.utils import resample
for k in range(10):
x1,y1=resample(X_train,Y_train)
rfc.fit(x1,y1)
y_pred=rfc.predict(X_test)
result.append(accuracy_score(y_pred,Y_test))
print(np.array(result).mean())
print(np.array(result).std())
| Cross validation technique | Accuracy | Standard Deviation |
|---|---|---|
| K- Fold | 93.3% | 4.5% |
| Stratified K- Fold | 93.1% | 0.735 |
| LOOCV | 93.3% | 0.24 |
| Bootstrapping | 97% | 0.004% |
All the cross validation technique gives good result 93-97% accuracy ; but deviation in scores are high in K-fold and LOOCV.
Bootstrapping and stratified k - Fold have less deviation . Bootsetstraping yields good result with 97% accuracy and very less deviation
#C. Apply hyper-parameter tuning techniques to get the best accuracy. Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
##HYPER PARAMETER TUNING WITH RANDOM SEARCH CROSS VALIDATION
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
clf = RandomForestClassifier(n_estimators=50)
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(2, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(2, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
samples = 10 # number of random samples
randomCV = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=samples) #default cv = 3
randomCV.fit(X_train_std, Y_train_std)
print("Best Params",randomCV.best_params_)
rfbest = randomCV.best_estimator_
print("Best estimators",rfbest)
print("Best train score",rfbest.score(X_train_std,Y_train_std))
print("Best test score",rfbest.score(X_test_std,Y_test_std))
##HYPER PARAMETER TUNING WITH GRID SEARCH CROSS VALIDATION
param_grid = {"max_depth": [3, None],
"max_features": [2, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [2, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
grid_search = GridSearchCV(clf, param_grid=param_grid)
grid_search.fit(X_train_std, Y_train_std)
print("Best Params",grid_search.best_params_)
RfGsbest=grid_search.best_estimator_
print("Best estimators",RfGsbest)
print("Best train score",RfGsbest.score(X_train_std,Y_train_std))
print("Best test score",RfGsbest.score(X_test_std,Y_test_std))
#D. Use any other technique/method which can enhance the model performance.Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.
#IDENTIFY FEATURE IMPORTANCE
features = Signaldf2.columns
importances = rfc.feature_importances_
plt.figure(figsize=(20,40))
indices = np.argsort(importances)[-120:]
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Standaridization /Normalization : Already done in above step
Target Imbalance: Already done in above steps
Attribute Removal and some dimensinality are reduced already
1.Low variance filter
2. High Correlated feature
3. Zero variance Filter
4. Missing Values abve 20% were filtered
pca = PCA()
pca.fit(X_train_std)
X_train_reduced = pca.fit_transform(X_train_std)
X_test_reduced = pca.fit_transform(X_test_std)
display(X_train_reduced.shape, X_test_reduced.shape)
plt.figure(figsize=(15,10));
plt.step(list(range(1,122)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
plt.figure(figsize=(15,10));
plt.step(list(range(1,122)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.axhline(y = 0.85, color = 'r', linestyle = '--')
plt.axvline(x = 60, color = 'r', linestyle = '--')
plt.show()
pca = PCA(n_components =60 , random_state = 14)
pca.fit(X_train_std)
X_train_reduced = pca.fit_transform(X_train_std)
X_test_reduced = pca.fit_transform(X_test_std)
display(X_train_reduced.shape, X_test_reduced.shape)
#cross validation after PCA dimensinality reduction to 60 components
param_grid = {'max_depth':[3,5,7,9,11],
'min_samples_leaf':[2,4,5],
'min_samples_split':[2,3,5,6]}
grid = GridSearchCV(clf, param_grid,cv=3,verbose = 3)
grid.fit(X_train_reduced, Y_train_std)
print(grid.best_params_)
pcabest = grid.best_estimator_
y_predict_Grid = pcabest.predict(X_test_reduced)
print("Best estimators",pcabest)
print("Best train score",pcabest.score(X_train_reduced,Y_train_std))
print("Best test score",pcabest.score(X_test_reduced,Y_test_std))
# Parameter tuned - Random forest model with PCA dimensionality reduction
Rf_pca = RandomForestClassifier(max_depth=11, min_samples_leaf=2, n_estimators=50)
Rf_pca.fit(X_train_reduced , Y_train_std)
Y_true, y_pred = Y_test_std, Rf_pca.predict(X_test_reduced) #prediction with test data
Y_traintrue, ytrain_pred = Y_train_std, Rf_pca.predict(X_train_reduced) #prediction with train data
#Training and testing scores
print("Training score: ",Rf_pca.score(X_train_reduced,Y_train_std))
print("Testing score: ",Rf_pca.score(X_test_reduced,Y_true))
#E. Display and explain the classification report in detail
# Classification Report
print('\n{}'.format(classification_report(Y_true, y_pred)))
# Confusion Matrix
cm = confusion_matrix(Y_true, y_pred)
print('\nConfusion Matrix:\n', cm)
df_cm = pd.DataFrame(cm, index = [i for i in [0,1]],columns = [i for i in [0,1]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
plt.show()
# Accuracy Score
auc = accuracy_score(Y_true, y_pred)
print('\nAccuracy Score:\n', auc.round(3))
RF_roc_auc = roc_auc_score(Y_true, Rf_pca.predict(X_test_reduced))
fpr, tpr, thresholds = roc_curve(Y_true, Rf_pca.predict_proba(X_test_reduced)[:,1])
plt.figure(figsize = (12.8 , 6))
plt.plot(fpr, tpr, label = 'RF classification (area = {})'.\
format(RF_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC RF performance')
plt.legend(loc = 'lower right')
plt.show()
Before PCA: [ with 121 dimension]
Training score 1.0 Test Score 0.9948805460750854
Precision of class 0 [fail] is 99 and recall is 100
Precision of class 1 [Pass] is 100 and recall is 99
F1 score of class 0,1 are 99 and 99 respectively
After PCA: [60 dimension ]
we have reduced the dimension into half that is with 60 dimension and reduced the overfit and complexity but there is a dip in test score .
Training score: 0.9918803418803419 Testing score: 0.7525597269624573
Precision of class 0 [fail] is 87 and recall is 64 - high precision
Precision of class 1 [Pass] is 72 and recall is 90 - high recall
F1 score of class 0,1 are 74 and 8 respectively
#F. Apply the above steps for all possible models that you have learnt so far
class BaseModeltraning:
def GetBasedModel(self):
print("\n\n------ Defining 8 base Models------\n")
#making a tuple ist with model name and function
basedModels = []
basedModels.append(('LR' , LogisticRegression()))
basedModels.append(('KNN' , KNeighborsClassifier(n_neighbors=3)))
basedModels.append(('NB' , GaussianNB()))
basedModels.append(('SVM' , SVC(gamma=0.01, C=100)))
basedModels.append(('CART' , DecisionTreeClassifier(max_depth=11, min_samples_leaf=2)))
basedModels.append(('AB' , AdaBoostClassifier(learning_rate=0.3, n_estimators=500)))
basedModels.append(('GBM' , GradientBoostingClassifier(learning_rate=0.3, n_estimators=500)))
basedModels.append(('RF' , RandomForestClassifier(max_depth=11, min_samples_leaf=2, n_estimators=50)))
print(basedModels)
print("\n\n")
return basedModels
def Modelvalidation(self,X_train, Y_train,X_test,Y_test,models,seed):
print("\n------ Model validation ------\n")
# Test options and evaluation metric
num_folds = 10
scoring = 'accuracy'
results = []
names = []
trainscore=[]
testscore=[]
for name, model in models:
kfold = StratifiedKFold(n_splits=num_folds, random_state=seed,shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
model.fit(X_train, Y_train)
Y_true, y_pred = Y_test, model.predict(X_test) #prediction with test data
Y_traintrue, ytrain_pred = Y_train, model.predict(X_train) #prediction with train data
#Training and testing scores
auctest = accuracy_score(Y_true, y_pred)
testscore.append(auctest.round(2))
#print('\nTraining Score:\n', auctest.round(3))
auctrain = accuracy_score(Y_traintrue, ytrain_pred)
trainscore.append(auctrain.round(2))
#print('\nTraining Score:\n', auctest.round(3))
msg = "%s: CV-%2f ,Train-%2f , Test-%2f" % (name, cv_results.mean(),auctrain,auctest)
print(msg)
return names, results,trainscore,testscore
print("\n\n***********Model Building***********\n")
#get the defind models
Basemodel=BaseModeltraning()
bmodels = Basemodel.GetBasedModel()
#get cross validation result of base models
bmnames,bmcvresults,bmtrainscore,bmtestscore = Basemodel.Modelvalidation(X_train_reduced, Y_train_std,X_test_reduced,Y_test_std,bmodels,seed)
#A. Display and compare all the models designed with their train and test accuracies.
def scoringtab(modelset,cvresult,bmtrainscore,bmtestscore ):
print("\n\n------ Base Model Scoring Table ------\n\n")
scores=[]
trainscore=[]
testscore=[]
models=[]
names=[]
for r in cvresult:
scores.append(round(r.mean()*100,2))
for name,model in modelset:
models.append(model)
names.append(name)
scoringtab = pd.DataFrame({'Modelname':names,'CV Score': scores,'Train Score': bmtrainscore,'Test Score': bmtestscore})
return scoringtab
basemodelscore=scoringtab(bmodels,bmcvresults,bmtrainscore,bmtestscore)
basemodelscore
#B. Select the final best trained model along with your detailed comments for selecting this model.
Overfit Models: Random forest, SVM , ADA-Boosting and Gradient Boosting are Overfit models
Underfit Models: LR, KNN. test accuracies are very less.
Among NB and CART , NB seems to be a best mdel cmparing all rss validatin score, train score and test score
Final BEST MODEL : Naive Bayes
finalmodel=GaussianNB()
finalmodel.fit(X_train_reduced , Y_train_std)
Y_true, y_pred = Y_test_std, finalmodel.predict(X_test_reduced) #prediction with test data
Y_traintrue, ytrain_pred = Y_train_std, finalmodel.predict(X_train_reduced) #prediction with train data
#Training and testing scores
print("Training score: ",finalmodel.score(X_train_reduced,Y_train_std))
print("Testing score: ",finalmodel.score(X_test_reduced,Y_true))
# Classification Report
print('\n{}'.format(classification_report(Y_true, y_pred)))
# Confusion Matrix
cm = confusion_matrix(Y_true, y_pred)
print('\nConfusion Matrix:\n', cm)
df_cm = pd.DataFrame(cm, index = [i for i in [0,1]],columns = [i for i in [0,1]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
plt.show()
# Accuracy Score
auc = accuracy_score(Y_true, y_pred)
print('\nAccuracy Score:\n', auc.round(3))
final_roc_auc = roc_auc_score(Y_true, finalmodel.predict(X_test_reduced))
fpr, tpr, thresholds = roc_curve(Y_true, finalmodel.predict_proba(X_test_reduced)[:,1])
plt.figure(figsize = (12.8 , 6))
plt.plot(fpr, tpr, label = 'Final NB classification (area = {})'.\
format(RF_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC RF performance')
plt.legend(loc = 'lower right')
plt.show()
#C. Pickle the selected model for future use.
pipeline = Pipeline([
('scl', StandardScaler()),
('pca', PCA(n_components=60)),
('NB', GaussianNB())])
pipeline.fit(X_train,Y_train)
y_predict = pipeline.predict(X_test)
print("Pipeline train score",pipeline.score(X_train, Y_train))
print("Pipeline test score",pipeline.score(X_test, Y_test))
def packthemodel(model):
print("\n\n------ Save the pickle file ------\n")
return pickle.dumps(model)
picklefile=packthemodel(pipeline)
print("Pickle File saved")
The given Dataset had 1567 rows and 592 columns. Dataset suffered from Curse of Dimensionality.
There were Missing Values, Outliers and zero/low variance columns in the data .
It was very important to handle dimensionality reduction. We used various techniques (PCA,variance threshold, multicolinear feature removal, etc).
we have reduced the model complexity . From 592 dimension we have reduced the dimension to 60 columns.
We chose Naive Bayes Model out of all 7 models that we built from this data.
Train accuracy of Final mdel is :92.3 Test accuracy of Final model is :86.8
Classification report
precision recall f1-score support
1 0.84 0.90
0 0.90 0.83
Pilpeline was dne with Naive bayes, standard scalar and PCA=60
picklefile is saved in 'picklefile' </b>